1 Executive Summary

This document has notes on LiveLab1 worksheet on R code and investigation on the NYC Water Quality dataset. It starts with initial data analysis exploration, exploring the dataset’s numeric variables, missingness, and outliers, where it should be noted that this dataset has high missingness in the Fluoride (mg/L) variable, and might not be useful for insight regarding Fluoride levels in water samples in the dataset. Types of variables and time period of record were also extracted, and basic modification of dataset in regards to time variables were performed, where new variables were mutated and format was changed.

It then investigate questions in regards to Turbidity, Chlorine and Fluoride levels across Sample Sites, Sample Classes and time. Data wrangling tools such as select(),filter(),group_by() were used to subset and reshape the dataset to provide insights. Visualisation was done using ggplot to produce histograms, boxplots and line graphs to present data trends and comparisons.

A student attempt at practicing data wrangling skills was done at the end to investigate the question of how chlorine readings in different sample classes change over time. Main discovery was the major difference in levels from the Compliance and Operational sample class, where the Operational sample class has a higher range of chlorine levels.

2 Load in packages and read in data

library(tidyverse)
library(lubridate)
library(naniar)
library(dplyr) 

data <- read_csv("NYC_Drinking_Water.csv")
data
## # A tibble: 72,709 x 10
##    `Sample Date` `Sample Time` `Sample Site` `Sample class` Location
##    <chr>         <time>        <chr>         <chr>          <chr>   
##  1 01/01/2015    12:19         1S07          Operational    "SS - S…
##  2 01/01/2015    11:15         1S04          Operational    "SS - S…
##  3 01/01/2015    10:09         1S03A         Operational    "SS - S…
##  4 01/01/2015    10:41         1S03B         Operational    "SS - S…
##  5 01/01/2015    09:38         11550         Compliance     "SS - I…
##  6 01/01/2015    08:41         13850         Compliance     "SS - I…
##  7 01/01/2015    09:07         15550         Compliance     "SS - I…
##  8 01/01/2015    11:16         17550         Compliance     "SS - I…
##  9 01/01/2015    07:41         20900         Operational    "SS - I…
## 10 01/01/2015    08:11         21850         Compliance     "SS - I…
## # … with 72,699 more rows, and 5 more variables: `Residual Free Chlorine
## #   (mg/L)` <dbl>, `Turbidity (NTU)` <dbl>, `Fluoride (mg/L)` <dbl>, `Coliform
## #   (Quanti-Tray) (MPN /100mL)` <chr>, `E.coli(Quanti-Tray) (MPN/100mL)` <chr>

Note how the Sample Date variable is of chr class, we will change it to date format later on. Also note how fluoride has a lot of NA values that we will have to clean up.

3 Exploratory Data Analysis

The dataset has 72,709 rows with 10 variables, with variable names listed here:

names(data)
##  [1] "Sample Date"                         "Sample Time"                        
##  [3] "Sample Site"                         "Sample class"                       
##  [5] "Location"                            "Residual Free Chlorine (mg/L)"      
##  [7] "Turbidity (NTU)"                     "Fluoride (mg/L)"                    
##  [9] "Coliform (Quanti-Tray) (MPN /100mL)" "E.coli(Quanti-Tray) (MPN/100mL)"

3.1 Visualising numeric variables

By using pipe %>% we are not changing the actual dataframe, it shows how the original dataset is processed by functions to produce graphical / numerical summaries.

  • select_if(is.numeric) select the variables that are of numeric class
  • pivot_longer() convert the dataset to longer format so that each observation is now 3 rows, each of the Fluoride, Chlorine and Turbidity variables
  • Visualisation: ggplot(aes(value)) lets us generate plots based on values, facet_wrap() lets us plot a different graph for each variable. geom_histogram() plots the histogram data on to the plot. theme_bw() and theme() customises the plot.
data %>%  
  select_if(is.numeric) %>% 
  pivot_longer(everything(),names_to = "variable", values_to = "value") %>% 
  ggplot(aes(value)) +
  facet_wrap(~ variable, scales = "free") +
  geom_histogram() +
  theme_bw(base_size = 18) +
  theme(strip.text.x = element_text(size = 8))

3.2 Explore missingness using naniar package

Using the naniar package, explore if any variables are substantially missing.

vis_miss(data, warn_large_data = FALSE)

Here shows that Turbidity has a small amount of missingness and that the Fluoride variable has a lot of missingness, knowing this we should keep in mind that this will affect our ability to analyse the Fluoride variable data.

3.3 Exploring outliers and skewness

Summary statistics:

data %>%
  select_if(is.numeric) %>% 
  pivot_longer(everything(),names_to = "variable", values_to = "value") %>% 
  group_by(variable) %>% 
  summarise(mean = mean(value, na.rm = TRUE), median = median(value, na.rm = TRUE))
## # A tibble: 3 x 3
##   variable                       mean median
##   <chr>                         <dbl>  <dbl>
## 1 Fluoride (mg/L)               0.714   0.71
## 2 Residual Free Chlorine (mg/L) 0.584   0.59
## 3 Turbidity (NTU)               0.739   0.73
  • group_by(variable) groups the data into Fluoride, Residual Free Chlorine and Turbidity groups. Then summarise() will provide the numerical summary statistics mean, median for each of the 3 groups.
  • na.rm = TRUE removes NA values.
data %>%
  select_if(is.numeric) %>% 
  pivot_longer(everything(),names_to = "variable", values_to = "value") %>% 
  ggplot(aes(value, x = variable)) +
  facet_wrap(~ variable, scales = "free") +
  geom_boxplot() +
  theme_bw(base_size = 18) +
  theme(strip.text.x = element_text(size = 8))

ggplot for each of the variables, this time we create a boxplot with geom_boxplot(). Here we can see Turbidity has some outliers at ~15, ~27, and ~34, where majority of the values are under 10. Similar in Chlorine, where there is a outlier of value -10.0

4 What types of variables do we have in our dataset?

summary() gives summary statistics for variables, the length of data, the mean, quartiles, median, min, max and NA’s. And str() gives the type of variables as well as a sneak peak of data.

summary(data)
##  Sample Date        Sample Time       Sample Site        Sample class      
##  Length:72709       Length:72709      Length:72709       Length:72709      
##  Class :character   Class1:hms        Class :character   Class :character  
##  Mode  :character   Class2:difftime   Mode  :character   Mode  :character  
##                     Mode  :numeric                                         
##                                                                            
##                                                                            
##                                                                            
##    Location         Residual Free Chlorine (mg/L) Turbidity (NTU)  
##  Length:72709       Min.   :-9.9900               Min.   : 0.0700  
##  Class :character   1st Qu.: 0.4500               1st Qu.: 0.6200  
##  Mode  :character   Median : 0.5900               Median : 0.7300  
##                     Mean   : 0.5845               Mean   : 0.7385  
##                     3rd Qu.: 0.7200               3rd Qu.: 0.8600  
##                     Max.   : 2.2000               Max.   :33.8000  
##                     NA's   :3                     NA's   :506      
##  Fluoride (mg/L) Coliform (Quanti-Tray) (MPN /100mL)
##  Min.   :0.03    Length:72709                       
##  1st Qu.:0.69    Class :character                   
##  Median :0.71    Mode  :character                   
##  Mean   :0.71                                       
##  3rd Qu.:0.73                                       
##  Max.   :0.89                                       
##  NA's   :63336                                      
##  E.coli(Quanti-Tray) (MPN/100mL)
##  Length:72709                   
##  Class :character               
##  Mode  :character               
##                                 
##                                 
##                                 
## 
str(data)
## tibble [72,709 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Sample Date                        : chr [1:72709] "01/01/2015" "01/01/2015" "01/01/2015" "01/01/2015" ...
##  $ Sample Time                        : 'hms' num [1:72709] 12:19:00 11:15:00 10:09:00 10:41:00 ...
##   ..- attr(*, "units")= chr "secs"
##  $ Sample Site                        : chr [1:72709] "1S07" "1S04" "1S03A" "1S03B" ...
##  $ Sample class                       : chr [1:72709] "Operational" "Operational" "Operational" "Operational" ...
##  $ Location                           : chr [1:72709] "SS - Shaft 7 of City Tunnel No. 1 - W/S Sedgwick Ave OPP W 167th St (Tun 1)" "SS - Shaft 4 of City Tunnel No.1 - IFO 2780 Reservoir Ave, E/S Reservoir Ave,\n1st SS N/O Strong Street, at the"| __truncated__ "SS - Shaft 3A of City Tunnel No. 2 - IFO 823 S/S E 233rd St, W/O Bronxwood Ave" "SS - Shaft 3B of City Tunnel No. 3 - Mosholu Ave, W/O Jerome Ave" ...
##  $ Residual Free Chlorine (mg/L)      : num [1:72709] 0.58 0.71 0.79 0.77 0.74 0.59 0.54 0.56 0.54 0.55 ...
##  $ Turbidity (NTU)                    : num [1:72709] 0.96 0.94 0.93 0.93 0.95 1.08 0.9 1 0.92 0.91 ...
##  $ Fluoride (mg/L)                    : num [1:72709] 0.79 0.8 0.79 0.8 NA NA NA NA NA NA ...
##  $ Coliform (Quanti-Tray) (MPN /100mL): chr [1:72709] "<1" "<1" "<1" "<1" ...
##  $ E.coli(Quanti-Tray) (MPN/100mL)    : chr [1:72709] "<1" "<1" "<1" "<1" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Sample Date` = col_character(),
##   ..   `Sample Time` = col_time(format = ""),
##   ..   `Sample Site` = col_character(),
##   ..   `Sample class` = col_character(),
##   ..   Location = col_character(),
##   ..   `Residual Free Chlorine (mg/L)` = col_double(),
##   ..   `Turbidity (NTU)` = col_double(),
##   ..   `Fluoride (mg/L)` = col_double(),
##   ..   `Coliform (Quanti-Tray) (MPN /100mL)` = col_character(),
##   ..   `E.coli(Quanti-Tray) (MPN/100mL)` = col_character()
##   .. )

This dataset has data types character, numeric, hms (Time). Note that Sample Date can be converted to date format.

5 What are the range of dates for which this data was collected?

data$`Sample Date` %>% min()
## [1] "01/01/2015"
data$`Sample Date` %>% max()
## [1] "12/31/2018"

This data has date ranges from 1st January, 2015 to 31st December, 2018, given by the minimum and maximum value under the Sample Date variable.

6 Basic modification of dataset

6.1 Add information about month of year, weekday, and week of year

First convert Sample Date to appropriate lubridate format in month-date-year date format by using mdy(‘Sample Date’) (see how it changes from “chr” to “date” variable), here the same name ‘Sample Date’ is used to replace the variable (changing the variable type).

data <- data %>% 
        mutate(`Sample Date` = mdy(`Sample Date`))

Add in extra variables: using hour function from lubridate package to create new variable ‘Hour’ from ‘Sample Time’, similarly for ‘Week of Year’, ‘Weekday’ and ‘Month’

  • lubridate::hour() tells R to use the hour() from lubridate package.

  • levels = month.name in the mutate() function puts the data in order of months from January to December instead of alphabetical.

data <- data %>%
        mutate(Hour = lubridate::hour(`Sample Time`), `Week of Year` = week(`Sample Date`),  `Weekday` = wday(`Sample Date`), `Month` = month(`Sample Date`))

# Extension - add order to the months

data <- data %>% 
  mutate(Month = factor(month.name[Month], levels = month.name)) 

data
## # A tibble: 72,709 x 14
##    `Sample Date` `Sample Time` `Sample Site` `Sample class` Location
##    <date>        <time>        <chr>         <chr>          <chr>   
##  1 2015-01-01    12:19         1S07          Operational    "SS - S…
##  2 2015-01-01    11:15         1S04          Operational    "SS - S…
##  3 2015-01-01    10:09         1S03A         Operational    "SS - S…
##  4 2015-01-01    10:41         1S03B         Operational    "SS - S…
##  5 2015-01-01    09:38         11550         Compliance     "SS - I…
##  6 2015-01-01    08:41         13850         Compliance     "SS - I…
##  7 2015-01-01    09:07         15550         Compliance     "SS - I…
##  8 2015-01-01    11:16         17550         Compliance     "SS - I…
##  9 2015-01-01    07:41         20900         Operational    "SS - I…
## 10 2015-01-01    08:11         21850         Compliance     "SS - I…
## # … with 72,699 more rows, and 9 more variables: `Residual Free Chlorine
## #   (mg/L)` <dbl>, `Turbidity (NTU)` <dbl>, `Fluoride (mg/L)` <dbl>, `Coliform
## #   (Quanti-Tray) (MPN /100mL)` <chr>, `E.coli(Quanti-Tray) (MPN/100mL)` <chr>,
## #   Hour <int>, `Week of Year` <dbl>, Weekday <dbl>, Month <fct>

Now the dataset has new variables ‘Hour’, ‘Week of Year’, ‘Weekday’, ‘Month’

7 Interrogating the data - wrangle the data to answer these questions!

7.1 Which date had the highest Turbidity reading?

data %>% 
  arrange(desc(`Turbidity (NTU)`)) %>% 
  select(`Turbidity (NTU)`, `Sample Date`)
## # A tibble: 72,709 x 2
##    `Turbidity (NTU)` `Sample Date`
##                <dbl> <date>       
##  1             33.8  2018-01-16   
##  2             27.1  2017-05-19   
##  3             14.1  2019-05-17   
##  4              6.97 2017-12-15   
##  5              6.97 2018-01-23   
##  6              6.68 2017-11-27   
##  7              6.29 2017-12-13   
##  8              6.04 2016-05-16   
##  9              5.96 2017-12-29   
## 10              5.89 2018-02-21   
## # … with 72,699 more rows

16th January, 2018 had the highest Turbidity with a reading of 33.80 NTU. Here we use arrange() to list the Turbidity values in descending order, and select the variable columns that we are interested in with select().

7.2 Which date had the highest reading of Residual Free Chlorine?

data %>%
  arrange(desc(`Residual Free Chlorine (mg/L)`)) %>% 
  select(`Residual Free Chlorine (mg/L)`, `Sample Date`)
## # A tibble: 72,709 x 2
##    `Residual Free Chlorine (mg/L)` `Sample Date`
##                              <dbl> <date>       
##  1                            2.2  2016-08-22   
##  2                            2.2  2016-11-02   
##  3                            1.8  2015-10-24   
##  4                            1.76 2017-03-25   
##  5                            1.56 2015-05-13   
##  6                            1.46 2015-05-18   
##  7                            1.46 2015-09-29   
##  8                            1.42 2015-10-02   
##  9                            1.42 2015-10-24   
## 10                            1.37 2015-09-07   
## # … with 72,699 more rows

22nd August, 2016 had the highest reading of Residual Free Chlorine with a reading of 2.2 mg/L.

7.3 Is there a difference between the median readings for Turbidity, Chlorine, and Fluoride for the different types of sample sites?

data %>% 
  group_by(`Sample class`) %>%
  summarise(med_chlorine = median(`Residual Free Chlorine (mg/L)`),
            med_turbidity = median(`Turbidity (NTU)`, na.rm = TRUE),
            med_flouride = median(`Fluoride (mg/L)`, na.rm = TRUE))
## # A tibble: 7 x 4
##   `Sample class`       med_chlorine med_turbidity med_flouride
##   <chr>                       <dbl>         <dbl>        <dbl>
## 1 Compliance                  NA             0.72         0.71
## 2 Entry Point                  0.63          0.72         0.72
## 3 Op-resample                  0.73          0.7          0.72
## 4 Operational                  0.69          0.75         0.71
## 5 Re-Sample                    0.39          0.67        NA   
## 6 Resample_Compliance          0.47          0.7         NA   
## 7 Resample_Operational         0.75          0.98        NA

Summary statistics for each of the Sample Classes listing their respective median readings for Turbidity, Chlorine, Fluoride. - Median readings for Chlorine spreads over the range 0.39 to 0.75 - Median readings for Turbidity has a narrower range around 0.70, with outlier median reading for Resample_Operational having 0.98 - Median readings for Fluoride are from 0.71-0.72, but also has NA values for 3 of the resample classes.

Notice how there are NA values, this might be because of the NA values in the dataset and the function fails to compute a median value out of NA values.

7.4 Create a boxplot to visualise the difference between Entry Point and Operational levels of Residual Free Chlorine.

Restrict the data to the 2 sample class of interest (Entry Point, Operational) by filter().

data %>% 
  filter(`Sample class` == "Entry Point" | `Sample class` == "Operational") %>% 
  ggplot(aes(x = `Sample class`, y = `Residual Free Chlorine (mg/L)`, fill = `Sample class` )) +
  geom_boxplot() +
  ggtitle("Residual Free Chlorine (mg/L) for Different Sample Classes") +
  theme_minimal() +
  theme(legend.position = "none")

We notice that there is higher variability in the Operational samples, and the classes has a similar median.

7.5 Which sample sites have the highest and lowest median readings for each chemical?

Here we create a new dataframe sample_summary where the dataset is grouped by different sample sites.

sample_summary <- data %>% 
  group_by(`Sample Site`) %>%
  summarise(med_chlorine = median(`Residual Free Chlorine (mg/L)`, na.rm = TRUE),
            med_turbidity = median(`Turbidity (NTU)`, na.rm = TRUE),
            med_fluoride = median(`Fluoride (mg/L)`, na.rm = TRUE))

Arrange the data by their median for Turbidity in descending order. n() is the number of observation. Extracting the top and bottom one in the arranged order list will give us the highest and lowest reading for the variable. Use drop_na() to ignore NA values.

sample_summary %>% 
  select(`Sample Site`, med_turbidity) %>% 
  drop_na() %>%
  arrange(desc(med_turbidity)) %>% 
  filter(row_number() %in% c(1, n()))
## # A tibble: 2 x 2
##   `Sample Site` med_turbidity
##   <chr>                 <dbl>
## 1 51900                  1.03
## 2 3SC26                  0.45

For Turbidity, sample site 51900 has the highest reading of 1.03 and sample site 3SC26 has the lowest reading of 0.45.

sample_summary %>%
  select(`Sample Site`, med_fluoride) %>%
  drop_na() %>%
  arrange(desc(med_fluoride)) %>% 
  filter(row_number() %in% c(1, n()))
## # A tibble: 2 x 2
##   `Sample Site` med_fluoride
##   <chr>                <dbl>
## 1 1S04                  0.81
## 2 32350                 0.69

For Fluoride, sample site 1S04 has the highest reading of 0.81 and sample site 32350 has the lowest reading of 0.69.

sample_summary %>% 
  select(`Sample Site`, med_chlorine) %>% 
  drop_na() %>%
  arrange(desc(med_chlorine)) %>% 
  filter(row_number() %in% c(1, n()))
## # A tibble: 2 x 2
##   `Sample Site` med_chlorine
##   <chr>                <dbl>
## 1 1S03A                 0.91
## 2 77750                 0.11

For Chlorine, sample site 1S03A has the highest reading of 0.91 and sample site 77750 has the lowest reading of 0.11.

7.6 Visualise the difference in readings between the top and bottom sites for Turbidity in different ways. Can you find anything interesting about the sites?

sites <- sample_summary %>% 
  select(`Sample Site`, med_turbidity) %>% 
  arrange(desc(med_turbidity)) %>% 
  filter(row_number() %in% c(1, n()))

data %>% 
  filter(`Sample Site` %in% pull(sites, `Sample Site`)) %>% 
  ggplot(aes(x = `Turbidity (NTU)`, fill = `Sample Site`)) +
  geom_histogram()

Sample site 51900 has the highest median reading, and sample site 3SC26 has lowest median reading for Turbidity. Notice how 51900 has a outlier of value greater than 3.5 NTU.

Line plot showing how the Turbidity reading change over time.

data %>% 
  filter(`Sample Site` %in% pull(sites, `Sample Site`)) %>%
  ggplot(aes(x = `Sample Date`, y = `Turbidity (NTU)`, color = `Sample Site`))+
  geom_line()

Interesting, it seems the site 51900 with the higher median turbidity reading is no longer operational, see how the reading/line stops early 2017.

7.7 How have the median readings for each of the chemicals changed over time?

data %>% 
  select(`Sample Date`, `Residual Free Chlorine (mg/L)`, `Turbidity (NTU)`, `Fluoride (mg/L)`) %>% 
  group_by(`Sample Date`) %>% 
  summarise(med_cl = median(`Residual Free Chlorine (mg/L)`, na.rm = TRUE), med_t = median(`Turbidity (NTU)`, na.rm = TRUE), med_f = median( `Fluoride (mg/L)`, na.rm = TRUE)) %>% 
  pivot_longer(- `Sample Date`, names_to = "chemical", values_to = "values") %>% 
  ggplot(aes(x = `Sample Date`, y = values, group = chemical, colour = chemical)) + geom_line()

There seems to be an inverse relationship between Turbidity and Residual free chlorine readings. Why is this? (Hint - consult the Water Quality Report!)

  • Chlorine is used for disinfection in water treatment to kill germs and bacterias, as Chlorine is introduced and increased in water samples, Turbidity levels are decreased as there are less contaminants after disinfection.

8 Time to research your own questions about the data!

8.1 How does the chlorine reading in different sample classes change over time?

We will use the median Residual Free Chlorine readings as a summary statistic to conclude records.

samplesites <- data %>% 
  group_by(`Sample Site`) %>%
  summarise(med_chlorine = median(`Residual Free Chlorine (mg/L)`, na.rm = TRUE))

sites <- samplesites %>% 
  select(`Sample Site`, med_chlorine) %>% 
  arrange(desc(med_chlorine)) %>% 
  filter(row_number() %in% c(1, n()))
            
data %>% 
  filter(`Sample Site` %in% pull(sites, `Sample Site`)) %>%
  ggplot(aes(x = `Sample Date`, y = `Residual Free Chlorine (mg/L)`, color = `Sample class`))+
  geom_line()

Overall, there are two main sample classes that have consistent record, where the Compliance sample class has a chlorine level range from 0.0 to 0.5 mg/L (with a outlier record of around 0.7 mg/L in late 2015), and the Operational sample class has chlorine level range from 0.5 to 1.4 mg/L.

We can see that the Operational sample class has higher chlorine levels compared to the Compliance sample class. This might be due to the difference in how much chlorine is needed for disinfection to these sample classes, where Operational sample class might need more chlorine for disinfection and Compliance need less.